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Abstract 

We  give  a  new  class  of  outer  bounds  on  the  marginal  polytope,  and  propose  a 
cutting-plane  algorithm  for  efficiently  optimizing  over  these  constraints.  When 
combined  with  a  concave  upper  bound  on  the  entropy,  this  gives  a  new  variational 
inference  algorithm  for  probabilistic  inference  in  discrete  Markov  Random  Fields 
(MRFs).  Valid  constraints  on  the  marginal  polytope  are  derived  through  a  series 
of  projections  onto  the  cut  polytope.  As  a  result,  we  obtain  tighter  upper  bounds 
on  the  log-partition  function.  We  also  show  empirically  that  the  approximations  of 
the  marginals  are  significantly  more  accurate  when  using  the  tighter  outer  bounds. 
Finally,  we  demonstrate  the  advantage  of  the  new  constraints  for  finding  the  MAP 
assignment  in  protein  structure  prediction. 


1  Introduction 

Graphical  models  such  as  Markov  Random  Fields  (MRFs)  have  been  successfully  applied  to  a  wide 
variety  of  fields,  from  computer  vision  to  computational  biology.  From  the  point  of  view  of  in¬ 
ference,  we  are  generally  interested  in  two  questions:  finding  the  marginal  probabilities  of  specific 
subsets  of  the  variables,  and  finding  the  Maximum  a  Posteriori  (MAP)  assignment.  Both  of  these 
require  approximate  methods. 

We  focus  on  a  particular  class  of  variational  approximation  methods  that  cast  the  inference  problem 
as  a  non-linear  optimization  over  the  marginal  polytope ,  the  set  of  valid  marginal  probabilities.  The 
selection  of  appropriate  marginals  from  the  marginal  polytope  is  guided  by  the  (non-linear)  entropy 
function.  Both  the  marginal  polytope  and  the  entropy  are  difficult  to  characterize  in  general,  reflect¬ 
ing  the  hardness  of  exact  inference  calculations.  Most  message-passing  algorithms  for  evaluating 
marginals,  including  belief  propagation  and  tree-reweighted  sum-product  (TRW),  operate  instead 
within  the  local  consistency  poly  tope,  characterized  by  pairwise  consistent  marginals.  For  general 
graphs,  this  is  an  outer  bound  of  the  marginal  polytope.  Various  approximations  have  also  been  sug¬ 
gested  for  the  entropy  function.  For  example,  in  the  TRW  algorithm  [10],  the  entropy  is  decomposed 
into  a  weighted  combination  of  entropies  of  tree-structured  distributions. 

Our  goal  here  is  to  provide  tighter  outer  bounds  on  the  marginal  polytope.  We  show  how  this  can 
be  achieved  efficiently  using  a  cutting-plane  algorithm,  iterating  between  solving  a  relaxed  problem 
and  adding  additional  constraints.  Cutting-plane  algorithms  are  a  well-known  technique  for  solving 
integer  linear  programs.  The  key  to  such  approaches  is  to  have  an  efficient  separation  algorithm 
which,  given  an  infeasible  solution,  can  quickly  find  a  violated  constraint,  generally  from  a  very 
large  class  of  valid  constraints  on  the  set  of  integral  solutions. 

The  motivation  for  our  approach  comes  from  the  cutting-plane  literature  for  the  maximum  cut  prob¬ 
lem.  Barahona  et  al.  [3]  showed  that  the  MAP  problem  in  pairwise  binary  MRFs  is  equivalent  to  a 
linear  optimization  over  the  cut  polytope,  which  is  the  convex  hull  of  all  valid  graph  cuts.  Tighter 
relaxations  were  obtained  by  using  a  separation  algorithm  together  with  the  cutting-plane  method¬ 
ology.  We  extend  this  work  by  deriving  a  new  class  of  outer  bounds  on  the  marginal  polytope  for 
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non-binary  and  non-pairwise  MRFs.  The  key  realization  is  that  valid  constraints  can  be  constructed 
by  a  series  of  projections  onto  the  cut  polytope1.  More  broadly,  we  seek  to  highlight  emerging 
connections  between  polyhedral  combinatorics  and  probabilistic  inference. 

2  Background 

Markov  Random  Fields.  Let  x£/  denote  a  random  vector  on  n  variables,  where,  for  simplicity, 
each  variable  xt  takes  on  the  values  in  \i  =  {0, 1, . . . ,  k  —  1}.  The  MRF  is  specified  by  a  set  of  d 
real  valued  potentials  or  sufficient  statistics  </>(x)  =  {^(x)}  and  a  parameter  vector  9  £ 

P(x;0)  =  exp{(0,0(x))  -  A{9)},  A(9)  =  log£xexnexp  {(0,  f(x))} 

where  (0,  f(x))  denotes  the  dot  product  of  the  parameters  and  the  sufficient  statistics.  In  pairwise 
MRFs ,  potentials  are  restricted  to  be  at  most  over  the  edges  in  the  graph.  We  assume  that  the 
potentials  are  indicator  functions,  i.e.,  <f>itS(x)  =  S(xi  =  s),  and  make  use  of  the  following  notation: 

Fi-,s  =  Ee[(j)i;s(x)]  =  p(xi  =  s;  0)  and  pi:j.st  =  Ee[^i:j.st(x)\  =  p(xt  =  s,  Xj  =  t;  0). 

Variational  inference.  The  inference  task  is  to  evaluate  the  mean  vector  p  =  Eg\4>(x)\.  The 
log-partition  function  A(9),  a  convex  function  of  0,  plays  a  critical  role  in  these  calculations.  In 
particular,  we  can  write  the  log-partition  function  in  terms  of  its  Fenchel-Legendre  conjugate  [11]: 

A(6)  =  swp^M{(0,p)-B(p)},  (1) 

where  B(p)  =  —H(p)  is  the  negative  entropy  of  the  distribution  parameterized  by  p  and  is  also 
convex.  A4  is  the  set  of  realizable  mean  vectors  p  known  as  the  marginal  polytope.  More  precisely, 
M  :=  {p  £  |  3 p(x)  s.t.  p  =  Ep[(j){xj\ }.  The  value  p*  £  M.  that  maximizes  (1)  is  precisely  the 

desired  mean  vector  corresponding  to  p(x;  0). 

Both  A4  and  the  entropy  //  (p)  are  difficult  to  characterize  in  general  and  have  to  be  approximated. 
We  call  the  resulting  approximate  mean  vectors  pseudomarginals.  Mean  field  algorithms  optimize 
over  an  inner  bound  on  the  marginal  polytope  (which  is  not  convex)  by  restricting  the  marginal  vec¬ 
tors  to  those  coming  from  simpler,  e.g.,  fully  factored,  distributions.  The  entropy  can  be  evaluated 
exactly  in  this  case  (the  distribution  is  simple).  Alternatively,  we  can  relax  the  optimization  to  be 
over  an  outer  bound  on  the  marginal  polytope  and  also  bound  the  entropy  function. 

Most  message  passing  algorithms  for  evaluating  marginal  probabilities  obtain  locally  consistent 
beliefs  so  that  the  pseudomarginals  over  the  edges  agree  with  the  singleton  pseudomarginals  at  the 
nodes.  The  solution  is  therefore  sought  within  the  local  marginal  polytope 

LOCAL(G)  =  {  p  >  0  I  Es6XiMi;s  =  1,  =  Ri;s  }  (2) 

Clearly,  A4  C  LOCAL(G)  since  true  marginals  are  also  locally  consistent.  For  trees,  A4  = 
LOCAL (G).  Both  LOCAL(G)  and  M  have  the  same  integral  vertices  for  general  graphs  [11,  6], 

Belief  propagation  can  be  seen  as  optimizing  pseudomarginals  over  LOCAL(G)  with  a  (non-convex) 
Bethe  approximation  to  the  entropy  [15],  The  tree-reweighted  sum-product  algorithm  [10],  on  the 
other  hand,  uses  a  concave  upper  bound  on  the  entropy,  expressed  as  a  convex  combination  of 
entropies  corresponding  to  the  spanning  trees  of  the  original  graph.  The  log-determinant  relaxation 
[12]  is  instead  based  on  a  semi-definite  outer  bound  on  the  marginal  polytope  combined  with  a 
Gaussian  approximation  to  the  entropy  function.  Since  the  moment  matrix  Mi(p)  can  be  written 
as  Eg[(l  x)T(l  x)]  for  p  £  A4,  the  outer  bound  is  obtained  simply  by  requiring  only  that  the 
pseudomarginals  lie  in  SDEFi(ATn)  =  {p  £  R+  |  Mi(p)  f  0}. 

Maximum  a  posteriori.  The  marginal  polytope  also  plays  a  critical  role  in  finding  the  MAP  assign¬ 
ment.  The  problem  is  to  find  an  assignment  x  £  \n  which  maximizes  p(x:  0),  or  equivalently: 

max  logp(x;  0)  =  max(0,  </>(x))  —  A{9)  =  sup  (0,  p)  —  A{9)  (3) 

x6x"  X6X’1  M 

where  the  log-partition  function  A{9)  remains  a  constant  and  can  be  ignored.  The  last  equality  holds 
because  the  optimal  value  of  the  linear  program  is  obtained  at  a  vertex  (integral  solution).  That  is, 
when  the  MAP  assignment  x*  is  unique,  the  maximizing  p*  is  4>(x*). 

’For  reasons  of  clarity,  our  results  will  be  given  in  terms  of  the  binary  marginal  polytope,  also  called  the 
correlation  polytope,  which  is  equivalent  to  the  cut  polytope  of  the  suspension  graph  of  the  MRF  [6]. 
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Algorithm  1  Cutting-plane  algorithm  for  probabilistic  inference 
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OUTER  <-  LOCAL(G) 

repeat 

H*  <-  argmax^gQUTER  {(^  A4)  -  b*(m)} 
Choose  projection  graph  Gn,  e.g.  single,  k,  or  full 
C  <—  Find_Violated_Inequalities(Gw, 
OUTER  <-  OUTER  D  C 

until  C  =  (did  not  find  any  violated  inequalities) 


Cycle  inequalities.  The  marginal  polytope  can  be  defined  by  the  intersection  of  a  large  number  of 
linear  inequalities.  We  focus  on  inequalities  beyond  those  specifying  LOCAL(G),  in  particular  the 
cycle  inequalities  [4,  2,  6],  Assume  the  variables  are  binary.  Given  an  assignment  x  £  {0, 1}”, 
(i,j)  £  E  is  cut  if  Xi  ^  Xj.  The  cycle  inequalities  arise  from  the  observation  that  a  cycle  must 
have  an  even  (possibly  zero)  number  of  cut  edges.  Suppose  we  start  at  node  i,  where  x,  =  0.  As  we 
traverse  the  cycle,  the  assignment  changes  each  time  we  cross  a  cut  edge.  Since  we  must  return  to 
Xi  =  0,  the  assignment  can  only  change  an  even  number  of  times.  For  a  cycle  G  and  any  F  C  G  such 
that  |F|  is  odd,  this  constraint  can  be  written  as  j)eC\F  ^(xi  ^  Xj)  +  J2a  -j)eF  xi  =  xj)  >  1. 
Since  this  constraint  is  valid  for  all  assignments  x  £  {0,  l}n,  it  holds  also  in  expectation.  Thus 

j;10  +  MijjOl)  +  (Mij;  00  +  Mij;  ll)  —  1  (4) 

(i,j)€C\F  { i,j)eF 

is  valid  for  any  //  £  Af{o,i}.  the  marginal  polytope  of  a  binary  pairwise  MRF.  For  a  chordless 
circuit  G,  the  cycle  inequalities  are  facets  of  A4{o,i}  [4].  They  suffice  to  characterize  A1{o,i}  f°r  a 
graph  G  if  and  only  if  G  has  no  AG-minor.  Although  there  are  exponentially  many  cycles  and  cycle 
inequalities  for  a  graph,  Barahona  and  Mahjoub  [4,  6]  give  a  simple  algorithm  to  separate  the  whole 
class  of  cycle  inequalities. 

To  see  whether  any  cycle  inequality  is  violated,  construct  the  undirected  graph  G'  =  ( V ,  E')  where 
V'  contains  nodes  ii  and  i2  for  each  i  £  V,  and  for  each  (i,j)  £  E,  the  edges  in  E'  are:  (ii,  ji)  and 
(*2, h)  with  weight  frj.w  +  0i,  and  (ii,j2)  and  (i2,j i)  with  weight  Hij;o o  +  MU;ii-  Then-  for 

each  node  i  £  V  we  find  the  shortest  path  in  G'  from  U  to  i2.  The  shortest  of  all  these  paths  will  not 
use  both  copies  of  any  node  j  (otherwise  the  path  j\  to  j2  would  be  shorter),  and  so  defines  a  cycle  in 
G  and  gives  the  minimum  value  of  J2(i,j)eC\F  (FU;  io  +  Fij;0i)  +  J2(i,j)£F  (MU;00  +  IHy,  it)-  If  this 
is  less  than  1,  we  have  found  a  violated  cycle  inequality;  otherwise,  //  satisfies  all  cycle  inequalities. 
Using  Dijkstra’s  shortest  paths  algorithm  with  a  Fibonacci  heap  [5],  the  separation  problem  can  be 
solved  in  time  0(n2  log  n  +  n\E\). 

3  Cutting-plane  algorithm 

Our  main  result  is  the  proposed  Algorithm  1  given  above.  The  algorithm  alternates  between  solv¬ 
ing  for  an  upper  bound  of  the  log-partition  function  (see  Eq.  1)  and  tightening  the  outer  bound  on 
the  marginal  polytope  by  incorporating  valid  constraints  that  are  violated  by  the  current  pseudo¬ 
marginals.  The  projection  graph  (line  4)  is  not  needed  for  binary  pairwise  MRFs  and  will  be  de¬ 
scribed  in  the  next  section.  We  start  the  algorithm  (line  1)  with  the  loose  outer  bound  on  the  marginal 
polytope  given  by  the  local  consistency  constraints.  Tighter  initial  constraints,  e.g.,  Mi(n)  F  0,  are 
possible  as  well. 

The  separation  algorithm  returns  a  feasible  set  C  given  by  the  intersection  of  halfspaces,  and  we  in¬ 
tersect  this  with  OUTER  to  obtain  a  smaller  feasible  space,  i.e.  a  tighter  relaxation.  The  experiments 
in  Section  5  use  the  separation  algorithm  for  cycle  inequalities.  However,  any  class  of  valid  con¬ 
straints  for  the  marginal  polytope  with  an  efficient  separation  algorithm  may  be  used  in  line  5.  Other 
examples  besides  the  cycle  inequalities  include  the  odd-wheel  and  bicycle  odd-wheel  inequalities 
[6],  and  also  linear  inequalities  that  enforce  positive  semi-definiteness  of  Mi(/x).  The  cutting-plane 
algorithm  is  in  effect  optimizing  the  variational  objective  (Eq.  1)  over  a  relaxation  of  the  marginal 
polytope  defined  by  the  intersection  of  all  inequalities  that  can  be  returned  in  line  5. 

Any  entropy  approximation  B*  (p)  can  be  used  so  long  as  we  can  efficiently  solve  the  optimization 
problem  in  line  3.  The  log-determinant  and  TRW  entropy  approximations  have  two  appealing  fea- 
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!0,1,2) 


Linear  transformation 


(0,1  }{2>  {0,2}{1}  {1 ,2}{0} 


{0,1, 2, 3}  {0}{1 ,2,3}  {1  }{0,2,3}  {2}{0,1 ,3}  {3}{0,1,2J  {0,1}{2,3}  {0,2}{1,3}  {0,3}{1 ,2} 

Figure  1:  Illustration  of  the  projection  H4  for  one  edge  (i,j)  £  E  where  Xi  =  {0, 1, 2}  and 
Xj  =  {0, 1, 2,  3}.  The  projection  graph  G~,  shown  on  the  right,  has  3  partitions  for  i  and  7  for  j. 


tures.  First,  as  upper  bounds  they  permit  the  algorithm  to  be  used  for  obtaining  tighter  upper  bounds 
on  the  log-partition  function.  Second,  the  objective  functions  to  be  maximized  are  convex  and  can 
be  solved  efficiently  using  conditional  gradient  or  other  methods. 

When  the  algorithm  terminates,  we  can  use  the  last  p*  vector  as  an  approximation  to  the  single  node 
and  edge  marginals.  The  results  given  in  Section  5  use  this  method.  The  algorithm  for  MAP  is  the 
same,  excluding  the  entropy  function  in  line  3;  the  optimization  is  simply  a  linear  program.  Since  all 
integral  vectors  in  the  relaxation  OUTER  are  extreme  points  of  the  marginal  polytope,  any  integral 
p*  is  the  MAP  assignment. 

4  Generalization  to  non-binary  MRFs 

In  this  section  we  give  a  new  class  of  valid  inequalities  for  the  marginal  polytope  of  non-binary  and 
non-pairwise  MRFs,  and  show  how  to  efficiently  separate  this  exponentially  large  set  of  inequalities. 
The  key  theoretical  idea  is  to  project  the  marginal  polytope  onto  different  binary  marginal  polytopes. 
Aggregation  and  projection  are  well-known  techniques  in  polyhedral  combinatorics  for  obtaining 
valid  inequalities  [6],  Given  a  linear  projection  <f)(x)  =  Ax,  any  valid  inequality  c'<|>(x)  <  b  for 
*!>(x)  also  gives  the  valid  inequality  c'/lx  <  b  for  x.  We  obtain  new  inequalities  by  aggregating  the 
values  of  each  variable. 

For  each  variable  i,  let  7r®  be  a  partition  of  its  values  into  two  non-empty  sets,  i.e.,  the  map  7r|  : 
Xi  {0, 1}  is  surjective.  Let  7Ti  =  [n] ,  nf, . . .}  be  a  collection  of  partitions  of  variable  i.  Define 
the  projection  graph  G =  (V, T,EW)  so  that  there  is  a  node  for  each  wf  £  and  nodes  7r  ?  and 
n  j  are  connected  if  (i,  j)  £  E.  We  call  the  graph  consisting  of  all  possible  variable  partitions  the 
full  projection  graph.  In  Figure  1  we  show  part  of  the  full  projection  graph  corresponding  to  one 
edge  (i,  j),  where  xt  has  three  values  and  Xj  has  four  values.  Intuitively,  a  partition  for  a  variable 
splits  its  values  into  two  clusters,  resulting  in  a  binary  variable.  For  example,  the  (new)  variable 
corresponding  to  the  partition  {0, 1}{2}  of  Xi  is  1  if  x%  =  2,  and  0  otherwise.  The  following  gives 
a  projection  of  marginal  vectors  of  non-binary  MRFs  onto  the  marginal  polytope  of  the  projection 
graph  Gn,  which  has  binary  variables  for  each  partition. 

Definition  1.  The  linear  map  lf,T  takes  p  £  A4  and  for  each  node  v  =  £  14  as- 

signs  fjfv;1  =  J2sexisx.^(s)=i^s  and  for  each  edge  e  =  KVj)  £  En  assigns  //c;11  = 

^--'Si€Xi,sj€Xj  S.t  71"?  (Si)=7rV(sj)  =  l  • 

To  construct  valid  inequalities  for  each  projection  we  need  to  characterize  the  image  space.  Let 
Al{o,i }(Gff)  denote  the  binary  marginal  polytope  of  the  projection  graph. 

Theorem  1.  The  image  of  the  projection  H4  is  At{opy(Gn),  i.e.  \f4  :  At  — >  At{opy(Gn). 

Proof.  Since  44  is  a  linear  map,  it  suffices  to  show  that,  for  every  extreme  point  p  £  A4,  ’L7r(/i)  £ 
M{o,i The  extreme  points  of  Ai  correspond  one-to-one  with  assignments  x  £  x"-  Given  an 
extreme  point  p,  £  M  and  variable  v  =  7r^  £  14,  define  x'(p)v  =  Y2seXi  st  7r<i(s)=i  Pi-S-  Since  p, 

is  an  extreme  point,  pi-s  =  1  for  exactly  one  value  s,  which  implies  that  x'(p)  £  {0,  lju^L  Then, 
^^(p)  =  E[(/)(x.' (p))},  showing  that  ^^(p)  £  Ad{0,i}(Gw).  □ 

This  result  allows  valid  inequalities  for  Ad{o.i }(G'7r)  to  carry  over  to  Ai.  In  general,  the  projec¬ 
tion  'lu  will  not  be  surjective.  Suppose  every  variable  has  k  values.  The  single  projection  graph , 
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where  |7Ti|  =  1  for  all  i,  has  one  node  per  variable  and  is  surjective.  The  full  projection  graph  has 
0(2fc)  nodes  per  variable.  A  cutting-plane  algorithm  may  begin  by  projecting  onto  a  small  graph, 
then  expanding  to  larger  graphs  only  after  satisfying  all  inequalities  given  by  the  smaller  one.  The 
k— projection  graph  Gk  =  ( 14 ,  A’/,: )  has  k  partitions  per  variable  corresponding  to  each  value  versus 
all  the  other  values. 


These  projections  yield  a  new  class  of  cycle  inequalities  for  the  marginal  polytope.  Consider  a  single 
projection  graph  G a  cycle  C  in  G,  and  any  F  C  C  such  that  \F\  is  odd.  Let  tt,  be  the  partition 
for  node  i.  We  obtain  the  following  valid  inequality  for  p  G  A4  by  applying  the  projection  'LT  and 
the  cycle  inequality: 

J2  v?j(xi  ±  x'j)  +  rfjWi  =  x,j)>i,  (5) 

(■ iJ)ec\F  (i,j)eF 

where 

k’iji.Xi  7^  xj)  =  'y  '  (6) 

SieXi,SjGXj  s.t.  7Ti(si)^7r j(sj) 

kij{xi  ~  xj)  =  y  '  LkjiSiSj-  (7) 

Si6Xi,SjSXj  s.t. 


It  is  revealing  to  contrast  (5)  with  Yl(i,j)ec\F  $(xi  ^  xj)  +  J2(i,j)eF  $(xi  =  xj )  >  1-  For  x  G  x"> 

the  latter  holds  only  for  \F\  =  1.  We  can  only  obtain  the  more  general  inequality  by  fixing  a  partition 
of  each  node’s  values. 


Theorem  2.  For  every  single  projection  graph  G ^  and  every  cycle  inequality  arising  from  a  chord¬ 
less  circuit  C  on  Gn,  3/i  G  LOCAL(G)\yVf  such  that  p  violates  that  inequality. 


Proof.  For  each  variable  i  G  V,  choose  Si,f  s.t.  7Ti(si)  =  1  and  7T,(fj)  =  0.  Assign  p,.^q  =  0 
for  q  G  Xi\{sG^}.  Similarly,  for  every  (i,j)  G  E,  assign  p,zj.qr  =  0  for  q  G  Xi\{si,U}  and 
r  G  Xj\{s:j-  tj } ■  The  polytope  resulting  from  the  projection  of  A4  onto  the  remaining  values  (e.g. 
p,i;Si)  is  isomorphic  to  A/l{o,i}  for  the  graph  Gn.  Barahona  and  Mahjoub  [4]  showed  that  the  cycle 
inequality  on  the  chordless  circuit  C  is  facet-defining  for  _A/l{o,i}-  Since  C  is  over  >  3  variables  from 
G.  this  cannot  be  a  facet  of  LOCAL(G).  Let  LOCAL{01}  be  the  projection  of  LOCAL(G)  onto  the 
remaining  values.  Thus,  3 //'  G  LOCAL{0,i}\A/1{o,i}.  and  we  can  assign  p  accordingly.  □ 


Note  that  the  theorem  implies  that  the  projected  cycle  inequalities  are  strictly  tighter  than 
LOCAL(G),  but  it  does  not  characterize  how  much  is  gained. 

If  all  n  variables  have  k  values,  then  there  are  0((2fc)")  different  single  projection  graphs.  However, 
since  for  every  cycle  inequality  in  the  single  projection  graphs  there  is  an  equivalent  cycle  inequality 
in  the  full  projection  graph,  it  suffices  to  consider  just  the  full  projection  graph.  Thus,  even  though 
the  projection  is  not  surjective,  the  full  projection  graph,  which  has  0(n2k)  nodes,  allows  us  to 
efficiently  obtain  a  tighter  relaxation  than  any  combination  of  projection  graphs  would  give.  In 
particular,  the  separation  problem  for  all  cycle  inequalities  (5)  for  all  single  projection  graphs,  when 
we  allow  some  additional  valid  inequalities  for  A4  (arising  from  the  cycle  using  more  than  one 
partition  for  some  variables),  can  now  be  solved  in  time  0(poly(n,  2fc)). 

Related  work.  In  earlier  work,  Althaus  et  al.  [1]  analyze  the  GMEC polyhedron ,  which  is  equivalent 
to  the  marginal  polytope.  They  use  a  similar  value-aggregation  technique  to  derive  valid  constraints 
from  the  triangle  inequalities.  Koster  et  al.  [8]  investigate  the  Partial  Constraint  Satisfaction  Prob¬ 
lem  polytope ,  which  is  also  equivalent  to  the  marginal  polytope.  They  used  value-aggregation  to 
show  that  a  class  of  cycle  inequalities  (corresponding  to  Eq.  5  for  f  ’  =  1)  are  valid  for  this  poly¬ 
tope,  and  give  an  algorithm  to  separate  the  inequalities  for  a  single  cycle.  Interestingly,  both  papers 
showed  that  these  constraints  are  facet-defining. 

Non-pairwise  Markov  random  fields.  These  results  could  be  applied  to  non-pairwise  MRFs  by 
first  projecting  the  marginal  vector  onto  the  marginal  polytope  of  a  pairwise  MRF.  More  generally, 
suppose  we  include  additional  variables  corresponding  to  the  joint  probability  of  a  cluster  of  vari¬ 
ables.  We  need  to  add  constraints  enforcing  that  all  variables  in  common  between  two  clusters  have 
the  same  marginals.  For  pairwise  clusters  these  are  simply  the  usual  local  consistency  constraints. 
We  can  now  apply  the  projections  of  the  previous  section,  considering  various  partitions  of  each 
cluster  variable,  to  obtain  a  tighter  relaxation  of  the  marginal  polytope. 
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Figure  2:  Accuracy  of  single  node  marginals  on  10  node  complete  graph  (100  trials). 


5  Experiments 

Computing  marginals.  We  experimented  with  Algorithm  1  using  both  the  log-determinant  [12]  and 
the  TRW  [10]  entropy  approximations.  These  trials  are  on  Ising  models ,  which  are  pairwise  MRFs 
with  Xi  £  {—1, 1}  and  potentials  <j>i(x)  =  a for  i  £  V  and  0,;?(x)  =  XiXj  for  (i,  j)  £  E.  Although 
TRW  can  efficiently  optimize  over  the  spanning  tree  polytope,  for  these  experiments  we  simply  use  a 
weighted  distribution  over  spanning  trees,  where  each  tree’s  weight  is  the  sum  of  the  absolute  value 
of  its  edge  weights  0l3 .  The  edge  appearance  probabilities  for  this  distribution  can  be  efficiently 
computed  using  the  Matrix  Tree  Theorem  [13].  We  optimize  the  TRW  objective  with  conditional 
gradient,  using  linear  programming  after  each  gradient  step  to  project  onto  OUTER.  We  used  the 
glpkmex  and  YALMIP  optimization  packages  within  Matlab,  and  wrote  the  separation  algorithm 
for  the  cycle  inequalities  in  Java. 

In  Figure  2  we  show  results  for  10  node  complete  graphs  with  0,  ~  C7[ — 1, 1]  and  0i3  ~  U[—x,  x], 
where  the  coupling  strength  is  varied  along  the  a; -axis  of  the  figure.  For  each  data  point  we  averaged 
the  results  over  100  trials.  The  y- axis  shows  the  average  t\  error  of  the  single  node  marginals.  These 
MRFs  are  highly  coupled,  and  loopy  belief  propagation  (not  shown)  with  a  .5  decay  rate  seldom  con¬ 
verges.  The  TRW  and  log-determinant  algorithms,  optimizing  over  the  local  consistency  polytope, 
give  pseudomarginals  only  slightly  better  than  loopy  BR  Even  adding  the  positive  semi-definite 
constraint  M\  (/.i)  A  0,  for  which  TRW  must  be  optimized  using  conditional  gradient  and  semi- 
definite  programming  for  the  projection  step,  does  not  improve  the  accuracy  by  much.  However, 
both  entropy  approximations  give  significantly  better  pseudomarginals  when  used  by  our  algorithm 
together  with  the  cycle  inequalities  (see  “TRW  +  Cycle”  and  “Logdet  +  Cycle”  in  the  figures).  For 
small  MRFs,  we  can  exactly  represent  the  marginal  polytope  as  the  convex  hull  of  its  2n  vertices. 
We  found  that  the  cycle  inequalities  give  nearly  as  good  accuracy  as  the  exact  marginal  polytope 
(see  “TRW  +  Marg”  and  “Logdet  +  Marg”). 

Our  work  sheds  some  light  on  the  relative  value  of  the  entropy  approximation  compared  to  the 
relaxation  of  the  marginal  poly  tope.  When  the  MRF  is  weakly  coupled,  both  entropy  approximations 
do  reasonably  well  using  the  local  consistency  polytope.  This  is  not  surprising:  the  limit  of  weak 
coupling  is  a  fully  disconnected  graph,  for  which  both  the  entropy  approximation  and  the  marginal 
polytope  relaxation  are  exact.  With  the  local  consistency  polytope,  both  entropy  approximations 
get  steadily  worse  as  the  coupling  increases.  In  contrast,  using  the  exact  marginal  poly  tope,  we 
see  a  peak  at  6  =  2,  then  a  steady  improvement  in  accuracy  as  the  coupling  term  grows.  This 
occurs  because  the  limit  of  strong  coupling  is  the  MAP  problem,  for  which  using  the  exact  marginal 
polytope  will  give  exact  results.  The  interesting  region  is  near  the  peak  9  =  2,  where  the  entropy 
term  is  neither  exact  nor  outweighed  by  the  coupling.  Our  algorithm  seems  to  “solve”  the  part  of 
the  problem  caused  by  the  local  consistency  polytope  relaxation:  TRW’s  accuracy  goes  from  .33  to 
.15,  and  log-determinant’s  accuracy  from  .17  to  .076.  The  fact  that  neither  entropy  approximation 
can  achieve  accuracy  below  .07,  even  with  the  exact  marginal  polytope,  motivates  further  research 
on  improving  this  part  of  the  approximation. 
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Figure  3:  Accuracy  of  single  node  marginals  with  TRW  entropy,  £  LI [  1 . 1]  and  ()tJ  £  U  \  4 .  4], 
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Figure  4:  MAP  for  protein  side-chain  prediction  with  Rosetta  energy  function. 


Next,  we  looked  at  the  number  of  iterations  (in  terms  of  the  loop  in  Algorithm  1)  the  algorithm  takes 
before  all  cycle  inequalities  are  satisfied.  In  each  iteration  we  add  to  OUTER  at  most2  n  violated 
cycle  inequalities,  coming  from  the  n  shortest  paths.  In  Figure  3  we  show  boxplots  of  the  l\  error 
of  the  single  node  marginals  for  both  10x10  grid  MRFs  (40  trials)  and  20  node  complete  MRFs  (10 
trials).  We  also  show  whether  the  pseudomarginals  are  on  the  correct  side  of  .5,  which  is  important  if 
we  were  doing  prediction  based  on  the  results  from  approximate  inference.  The  middle  line  gives  the 
median,  the  boxes  show  the  upper  and  lower  quartiles,  and  the  whiskers  show  the  extent  of  the  data. 
Iteration  1  corresponds  to  TRW  with  only  the  local  consistency  constraints.  For  the  grid  MRFs,  all  of 
the  cycle  inequalities  were  satisfied  within  10  iterations.  We  observed  the  same  convergence  results 
on  a  30x30  grid,  although  we  could  not  assess  the  accuracy  due  to  the  difficulty  of  exact  marginals 
calculation.  For  the  complete  graph  MRFs,  the  algorithm  took  many  more  iterations  before  all  cycle 
inequalities  were  satisfied. 

Protein  side-chain  prediction.  We  next  applied  our  algorithm  to  the  problem  of  predicting  protein 
side-chain  configurations.  Given  the  3-dimensional  structure  of  a  protein’s  backbone,  the  task  is  to 
predict  the  relative  angle  of  each  amino  acid’s  side-chain.  The  angles  are  discretized  into  at  most 
45  values.  Yanover  et  al.  [14]  showed  that  minimization  of  the  Rosetta  energy  function  corresponds 
to  finding  the  MAP  assignment  of  a  non-binary  pairwise  MRF.  They  also  showed  that  the  tree- 
reweighted  max-product  algorithm  [9]  can  be  used  to  solve  the  LP  relaxation  given  by  LOCAL(G), 
and  that  this  succeeds  in  finding  the  MAP  assignment  for  339  of  the  369  proteins  in  their  data  set. 
However,  the  optimal  solution  to  the  LP  relaxation  for  the  remaining  30  proteins,  arguably  the  most 
difficult  of  the  proteins,  is  fractional. 

Using  the  /.'-projection  graph  and  projected  cycle  inequalities,  we  succeeded  in  finding  the  MAP 
assignment  for  all  proteins  except  for  the  protein  ‘  lrl6’ .  We  show  in  Figure  4  the  number  of  cutting- 
plane  iterations  needed  for  each  of  the  30  proteins.  In  each  iteration,  we  solve  the  LP  relaxation, 
and,  if  the  solution  is  not  integral,  run  the  separation  algorithm  to  find  violated  inequalities.  For  the 
protein  ‘  1  rl6 ’ ,  after  12  cutting-plane  iterations,  the  solution  was  not  integral,  and  we  could  not  find 
any  violated  cycle  inequalities  using  the  fc -projection  graph.  We  then  tried  using  the  full  projection 
graph,  and  found  the  MAP  after  just  one  (additional)  iteration.  Figure  4  shows  one  of  the  cycle 
inequalities  (5)  in  the  full  projection  graph  that  was  found  to  be  violated.  The  cut  edges  indicate 
the  3  edges  in  F.  The  violating  fi  had  fj,36.s  =  .1667  for  s  £  {0, 1,  2,  3, 4, 5},  ^38;6  =  -3333, 
M38;4  =  -6667,  ha3-s  =  .1667  for  s  £  {1,2,4,  5},  n±3-3  =  .3333,  and  zero  for  all  other  values  of 
these  variables.  This  example  shows  that  the  relaxation  given  by  the  full  projection  graph  is  strictly 
tighter  than  that  of  the  fc-projection  graph. 


2  Many  fewer  inequalities  were  added,  since  not  all  cycles  in  G'  are  simple  cycles  in  G. 
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The  commercial  linear  programming  solver  CPLEX  10.0  solves  each  LP  relaxation  in  under  75  sec¬ 
onds.  Using  simple  heuristics,  the  separation  algorithm  runs  in  seconds,  and  we  find  each  protein’s 
MAP  assignment  in  under  11.3  minutes.  Kingsford  et  al.  [7]  found,  and  we  also  observed,  that 
CPLEX’s  branch-and-cut  algorithm  for  solving  integer  linear  programs  also  works  well  for  these 
problems.  One  interesting  future  direction  would  be  to  combine  the  two  approaches,  using  our  new 
outer  bounds  within  the  branch-and-cut  scheme.  Our  results  show  that  the  new  outer  bounds  are 
powerful,  allowing  us  to  find  the  MAP  solution  for  all  of  the  MRFs,  and  suggesting  that  using  them 
will  also  lead  to  significantly  more  accurate  marginals  for  non-binary  MRFs. 

6  Conclusion 

The  facial  structure  of  the  cut  polytope,  equivalently,  the  binary  marginal  polytope,  has  been  well- 
studied  over  the  last  twenty  years.  The  cycle  inequalities  are  just  one  of  many  large  classes  of  valid 
inequalities  for  the  cut  polytope  for  which  efficient  separation  algorithms  are  known.  Our  theoretical 
results  can  be  used  to  derive  outer  bounds  for  the  marginal  polytope  from  any  of  the  valid  inequalities 
on  the  cut  polytope.  Our  approach  is  particularly  valuable  because  it  takes  advantage  of  the  sparsity 
of  the  graph,  and  only  uses  additional  constraints  when  they  are  guaranteed  to  affect  the  solution. 
An  interesting  open  problem  is  to  develop  new  message-passing  algorithms  which  can  incorporate 
cycle  and  other  inequalities,  to  efficiently  do  the  optimization  within  the  cutting-plane  algorithm. 
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